-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Auto Parallel] fix enable_delay_scale_loss for static auto parallel … #68525
[Auto Parallel] fix enable_delay_scale_loss for static auto parallel … #68525
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
daab76a
to
1bed945
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -636,6 +636,94 @@ def parse_program( | |||
return grad_to_gradient_merge | |||
|
|||
|
|||
def _find_trival_optimizer_ops(block): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里只用 name string 判断 optimizer op 未来很容易遗漏,后续可能想一下用 一个 固定 opt_op_name_list 统一维护。
…&& fix sharding degree
5c6d75d
to
d7c8913
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
单独看这个API,增加默认值参数是兼容性升级 |
在auto_trainer中也做了对应的修改:PaddlePaddle/PaddleNLP#9217 ,用户如果自己调用,也不会遇到问题 |
ok, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…&& fix sharding degree (PaddlePaddle#68525)
PR Category
Auto Parallel
PR Types
Bug fixes
Description
修复动静半自动并行中,针对enable_delay_scale_loss的行为。在自动并行中,默认使用enable_delay_scale_loss的逻辑。
动态图手动的enable_delay_scale_loss的逻辑中,会先在sp/dp/sharding并行组对grad进行规约,再对规约的结果除以acc的step数。但目前自动并行的实现中,先对每个add的grad除以acc的step数,再在sp/dp/sharding并行组对grad进行规约。这种做法在grad较小时会有数值精度损失的风险。
因此,本pr:
(1)适配动态图自动并行的逻辑,强制在优化器更新前触发通信,然后对梯度进行scale
(2)适配静态图自动并行的逻辑,适配auto_parallel_gradient_merge_pass,将grad的scale移动到reduce通信后进行
Pcard-76459